{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Understanding text  (and handling it in Python)\n",
    "###  Fiona Pigott\n",
    "\n",
    "A Python notebook to teach you everything you never wanted to know about text encoding (specifically ASCII, UTF-8, and the difference therein but we'll explain what some others mean).\n",
    "\n",
    "Credit to these sites for a helpful description of different file encodings:\n",
    "- https://en.wikipedia.org/wiki/UTF-8\n",
    "- http://stackoverflow.com/questions/700187/unicode-utf-ascii-ansi-format-differences\n",
    "- http://csharpindepth.com/Articles/General/Unicode.aspx\n",
    "\n",
    "And to these pages for a better understanding of emoji specifically:\n",
    "- http://apps.timwhitlock.info/emoji/tables/unicode\n",
    "- http://unicode.org/charts/PDF/UFE00.pdf\n",
    "- http://www.unicode.org/Public/emoji/2.0/emoji-data.txt (a list of all of the official emoji)\n",
    "\n",
    "And if you got here from Data-Science-45-min intros, check out https://github.com/fionapigott/emoji-counter for this tutorial and (a little) more.\n",
    "\n",
    "\n",
    "-------------------------------"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Part 0: What do you mean by \"text encoding\"?\n",
    "----------------------\n",
    "\n",
    "A text encoding is a scheme that allows us to convert between binary (stored on your computer) and a character that you can display and make sense of. A text encoding does *not* define a font.\n",
    "\n",
    "When I say \"character\" I mean \"unicode code point.\" Code point -> character is a 1->1 mapping of meaning. A font just decides how to display that character. Each emoji has a code point assigned to it by the Unicode Consortium, and \"GRINNING FACE WITH SMILING EYES\" should be a grinning face with smiley eyes on any platform. Windows Wingdigs, if you remember that regrettable period, is a font.\n",
    "\n",
    "I'm going to use \"code point\" and \"character\" a little bit interchangeably. If you can get the code point represented by a string of bits, you can figure out what character it represents.\n",
    "\n",
    "**Decode = convert binary data to a code point**\n",
    "\n",
    "**Encode = convert a code point (a big number) to binary data that you can write somewhere**\n",
    "\n",
    "You code will always:\n",
    "- Ingest binary data (say, my_tweets.txt)\n",
    "- Decode that data into characters (*whether or not you have to type **decode**.*)\n",
    "- Encode that data so that you cna write it again (*whether or not you type **encode**.* You can't write \"128513\" to a single bit.)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Part 1: Text encodings are not magic\n",
    "-----------------\n",
    "\n",
    "### Get the actual data of the file in your terminal with xxd.\n",
    "\n",
    "I want to spend a few minutes convincing you of what I'm about to say about text encoding in Python. \n",
    "\n",
    "We're spending time in the terminal to make it painfully, horribly clear that text encoding/decoding is not some Python thing, but rather exactly what every text-display program does every time you convert some binary (stored on your computer) to something that you can read.\n",
    "\n",
    "### ASCII\n",
    "ASCII is a character-encoding scheme where each character fits in exactly 1 byte--8 bits. ASCII, however, uses *only* the bottom 7 bits of an 8-bit byte, and thus can take only 2^7 (128) values. The value of the byte that encodes a character is exactly that character's code point.\n",
    "\n",
    "\"h\" and \"i\" are both ascii characters--they fit in one byte in the ascii encoding scheme. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "hi\n",
      "0000000: 68 69                                            hi\n",
      "0000000: 01101000 01101001                                      hi\n"
     ]
    }
   ],
   "source": [
    "!printf \"hi\\n\"\n",
    "!printf \"hi\" | xxd -g1\n",
    "!printf \"hi\" | xxd -b -g1"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "0 -> u'\\x00'->'\u0000'\n",
      "1 -> u'\\x01'->'\u0001'\n",
      "2 -> u'\\x02'->'\u0002'\n",
      "3 -> u'\\x03'->'\u0003'\n",
      "4 -> u'\\x04'->'\u0004'\n",
      "5 -> u'\\x05'->'\u0005'\n",
      "6 -> u'\\x06'->'\u0006'\n",
      "7 -> u'\\x07'->'\u0007'\n",
      "8 -> u'\\x08'->'\b'\n",
      "9 -> u'\\t'->'\t'\n",
      "10 -> u'\\n'->'\n",
      "'\n",
      "11 -> u'\\x0b'->'\u000b",
      "'\n",
      "12 -> u'\\x0c'->'\f",
      "'\n",
      "13 -> u'\\r'->'\r",
      "'\n",
      "14 -> u'\\x0e'->'\u000e'\n",
      "15 -> u'\\x0f'->'\u000f'\n",
      "16 -> u'\\x10'->'\u0010'\n",
      "17 -> u'\\x11'->'\u0011'\n",
      "18 -> u'\\x12'->'\u0012'\n",
      "19 -> u'\\x13'->'\u0013'\n",
      "20 -> u'\\x14'->'\u0014'\n",
      "21 -> u'\\x15'->'\u0015'\n",
      "22 -> u'\\x16'->'\u0016'\n",
      "23 -> u'\\x17'->'\u0017'\n",
      "24 -> u'\\x18'->'\u0018'\n",
      "25 -> u'\\x19'->'\u0019'\n",
      "26 -> u'\\x1a'->'\u001a'\n",
      "27 -> u'\\x1b'->'\u001b'\n",
      "28 -> u'\\x1c'->'\u001c",
      "'\n",
      "29 -> u'\\x1d'->'\u001d",
      "'\n",
      "30 -> u'\\x1e'->'\u001e",
      "'\n",
      "31 -> u'\\x1f'->'\u001f'\n",
      "32 -> u' '->' '\n",
      "33 -> u'!'->'!'\n",
      "34 -> u'\"'->'\"'\n",
      "35 -> u'#'->'#'\n",
      "36 -> u'$'->'$'\n",
      "37 -> u'%'->'%'\n",
      "38 -> u'&'->'&'\n",
      "39 -> u\"'\"->'''\n",
      "40 -> u'('->'('\n",
      "41 -> u')'->')'\n",
      "42 -> u'*'->'*'\n",
      "43 -> u'+'->'+'\n",
      "44 -> u','->','\n",
      "45 -> u'-'->'-'\n",
      "46 -> u'.'->'.'\n",
      "47 -> u'/'->'/'\n",
      "48 -> u'0'->'0'\n",
      "49 -> u'1'->'1'\n",
      "50 -> u'2'->'2'\n",
      "51 -> u'3'->'3'\n",
      "52 -> u'4'->'4'\n",
      "53 -> u'5'->'5'\n",
      "54 -> u'6'->'6'\n",
      "55 -> u'7'->'7'\n",
      "56 -> u'8'->'8'\n",
      "57 -> u'9'->'9'\n",
      "58 -> u':'->':'\n",
      "59 -> u';'->';'\n",
      "60 -> u'<'->'<'\n",
      "61 -> u'='->'='\n",
      "62 -> u'>'->'>'\n",
      "63 -> u'?'->'?'\n",
      "64 -> u'@'->'@'\n",
      "65 -> u'A'->'A'\n",
      "66 -> u'B'->'B'\n",
      "67 -> u'C'->'C'\n",
      "68 -> u'D'->'D'\n",
      "69 -> u'E'->'E'\n",
      "70 -> u'F'->'F'\n",
      "71 -> u'G'->'G'\n",
      "72 -> u'H'->'H'\n",
      "73 -> u'I'->'I'\n",
      "74 -> u'J'->'J'\n",
      "75 -> u'K'->'K'\n",
      "76 -> u'L'->'L'\n",
      "77 -> u'M'->'M'\n",
      "78 -> u'N'->'N'\n",
      "79 -> u'O'->'O'\n",
      "80 -> u'P'->'P'\n",
      "81 -> u'Q'->'Q'\n",
      "82 -> u'R'->'R'\n",
      "83 -> u'S'->'S'\n",
      "84 -> u'T'->'T'\n",
      "85 -> u'U'->'U'\n",
      "86 -> u'V'->'V'\n",
      "87 -> u'W'->'W'\n",
      "88 -> u'X'->'X'\n",
      "89 -> u'Y'->'Y'\n",
      "90 -> u'Z'->'Z'\n",
      "91 -> u'['->'['\n",
      "92 -> u'\\\\'->'\\'\n",
      "93 -> u']'->']'\n",
      "94 -> u'^'->'^'\n",
      "95 -> u'_'->'_'\n",
      "96 -> u'`'->'`'\n",
      "97 -> u'a'->'a'\n",
      "98 -> u'b'->'b'\n",
      "99 -> u'c'->'c'\n",
      "100 -> u'd'->'d'\n",
      "101 -> u'e'->'e'\n",
      "102 -> u'f'->'f'\n",
      "103 -> u'g'->'g'\n",
      "104 -> u'h'->'h'\n",
      "105 -> u'i'->'i'\n",
      "106 -> u'j'->'j'\n",
      "107 -> u'k'->'k'\n",
      "108 -> u'l'->'l'\n",
      "109 -> u'm'->'m'\n",
      "110 -> u'n'->'n'\n",
      "111 -> u'o'->'o'\n",
      "112 -> u'p'->'p'\n",
      "113 -> u'q'->'q'\n",
      "114 -> u'r'->'r'\n",
      "115 -> u's'->'s'\n",
      "116 -> u't'->'t'\n",
      "117 -> u'u'->'u'\n",
      "118 -> u'v'->'v'\n",
      "119 -> u'w'->'w'\n",
      "120 -> u'x'->'x'\n",
      "121 -> u'y'->'y'\n",
      "122 -> u'z'->'z'\n",
      "123 -> u'{'->'{'\n",
      "124 -> u'|'->'|'\n",
      "125 -> u'}'->'}'\n",
      "126 -> u'~'->'~'\n",
      "127 -> u'\\x7f'->''\n"
     ]
    }
   ],
   "source": [
    "# Generate a list of all of the ASCII characters:\n",
    "# 'unichr' is a built-in Python function to take a number to a unicode code point \n",
    "# (I'll talk more about this and some other built-ins later)\n",
    "for i in range(0,128):\n",
    "    print str(i) + \" -> \" + repr(unichr(i)) + \"->\" + \"'\" + unichr(i).encode(\"ascii\") + \"'\""
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "ename": "UnicodeEncodeError",
     "evalue": "'ascii' codec can't encode character u'\\x81' in position 0: ordinal not in range(128)",
     "output_type": "error",
     "traceback": [
      "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m",
      "\u001b[0;31mUnicodeEncodeError\u001b[0m                        Traceback (most recent call last)",
      "\u001b[0;32m<ipython-input-3-99e41bb4ce9c>\u001b[0m in \u001b[0;36m<module>\u001b[0;34m()\u001b[0m\n\u001b[1;32m      1\u001b[0m \u001b[0;31m# And if you try to use \"ascii\" encoding on a character whose value is too high:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m      2\u001b[0m \u001b[0;31m# Hint: you've definitely seen this error before\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m----> 3\u001b[0;31m \u001b[0munichr\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;36m129\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mencode\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m\"ascii\"\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m",
      "\u001b[0;31mUnicodeEncodeError\u001b[0m: 'ascii' codec can't encode character u'\\x81' in position 0: ordinal not in range(128)"
     ]
    }
   ],
   "source": [
    "# And if you try to use \"ascii\" encoding on a character whose value is too high:\n",
    "# Hint: you've definitely seen this error before\n",
    "unichr(129).encode(\"ascii\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### UTF-8 (most commonly used multi-byte encoding, used by Twitter)\n",
    "You might be familiar with the concept of Huffman Coding (https://en.wikipedia.org/wiki/Huffman_coding). Huffamn coding is a way of losslessly compressing data by encoding the most common values with the least amount of information. A Huffman coding tree of the English language, might, for example, assign \"e\" a value of a single bit.\n",
    "\n",
    "UTF-8 encoding is similar to a Huffman encoding. ASCII-compatible characters are encoded exactly the same way (a file that is UTF-8 encoded but contains only the 128 ASCII-compatible characters is effectively ASCII encoded. This way, those common characters occupy only one byte. All furter characters are encoded in multiple bytes.\n",
    "\n",
    "The multibyte encoding scheme works like this:\n",
    "- The number of leading 1s in the first byte maps to the length of the character in bytes.\n",
    "- Each following character in the multibyte squence begins with '10'\n",
    "- The value of the unicode code point is encoded in all of the unused bits. That is, every bit that isn't either a leading '1' of the first byte, or a leading '10' of the following bytes."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "0b11110000\n",
      "Count the 1s at the beginning of the bit string: 4!\n"
     ]
    }
   ],
   "source": [
    "# Example: \\xf0 is the leading byte for a 4-character emoji:\n",
    "print bin(ord('\\xf0'))\n",
    "# And it has 4 1s!\n",
    "print \"Count the 1s at the beginning of the bit string: 4!\""
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now look at a multi-byte character, \"GRINNING FACE WITH SMILING EYES.\" This guy doesn't fit in a single byte. In fact, his encoding takes 4 bytes (http://apps.timwhitlock.info/unicode/inspect/hex/1F601)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "😁\n",
      "0000000: f0 9f 98 81                                      ....\n",
      "0000000: 11110000 10011111 10011000 10000001                    ....\n",
      "0000000: f0 9f 93 8a                                      ....\n",
      "0000000: e2 9d a4 ef b8 8f                                ......\n"
     ]
    }
   ],
   "source": [
    "!printf \"😁\\n\" \n",
    "!printf \"😁\" | xxd -g1\n",
    "!printf \"😁\" | xxd -b -g1\n",
    "# Figure out what some weird emoji is: \n",
    "# https://twitter.com/jrmontag/status/677621827410255872\n",
    "!printf \"📊\" | xxd -g1\n",
    "# https://twitter.com/SRKCHENNAIFC/status/677894680303017985\n",
    "!printf \"❤️\" | xxd -g1 "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "This is pretty much exactly what you get when you're looking at Tweet data. If you don't believe me, try:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "cat: test_tweet.json: No such file or directory\n"
     ]
    }
   ],
   "source": [
    "text = !cat test_tweet.json | xxd -g1 | grep \"f0 9f 98 81\"\n",
    "for line in text:\n",
    "    print line"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "ename": "ValueError",
     "evalue": "invalid literal for int() with base 16: 'cat: te'",
     "output_type": "error",
     "traceback": [
      "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m",
      "\u001b[0;31mValueError\u001b[0m                                Traceback (most recent call last)",
      "\u001b[0;32m<ipython-input-7-c87aadfda993>\u001b[0m in \u001b[0;36m<module>\u001b[0;34m()\u001b[0m\n\u001b[1;32m      1\u001b[0m \u001b[0;31m# position of the emoji in bytes:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m----> 2\u001b[0;31m \u001b[0mstart\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mint\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mtext\u001b[0m\u001b[0;34m[\u001b[0m\u001b[0;36m0\u001b[0m\u001b[0;34m]\u001b[0m\u001b[0;34m[\u001b[0m\u001b[0;36m0\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;36m7\u001b[0m\u001b[0;34m]\u001b[0m\u001b[0;34m,\u001b[0m\u001b[0;36m16\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m      3\u001b[0m \u001b[0mend\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mint\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mtext\u001b[0m\u001b[0;34m[\u001b[0m\u001b[0;36m0\u001b[0m\u001b[0;34m]\u001b[0m\u001b[0;34m[\u001b[0m\u001b[0;36m0\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;36m7\u001b[0m\u001b[0;34m]\u001b[0m\u001b[0;34m,\u001b[0m\u001b[0;36m16\u001b[0m\u001b[0;34m)\u001b[0m \u001b[0;34m+\u001b[0m \u001b[0;36m16\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m      4\u001b[0m \u001b[0;32mprint\u001b[0m \u001b[0;34m\"Run the following to cat out just the first line of bytes from the hexdump:\"\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m      5\u001b[0m \u001b[0;32mprint\u001b[0m \u001b[0;34m\"!head -c{} test_tweet.json | tail -c{}\"\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mformat\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mend\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mend\u001b[0m\u001b[0;34m-\u001b[0m\u001b[0mstart\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n",
      "\u001b[0;31mValueError\u001b[0m: invalid literal for int() with base 16: 'cat: te'"
     ]
    }
   ],
   "source": [
    "# position of the emoji in bytes:\n",
    "start = int(text[0][0:7],16)\n",
    "end = int(text[0][0:7],16) + 16\n",
    "print \"Run the following to cat out just the first line of bytes from the hexdump:\"\n",
    "print \"!head -c{} test_tweet.json | tail -c{}\".format(end, end-start)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Rolling your own UTF-8 decoder.\n",
    "\n",
    "This is going to be fun. And by 'fun' I mean \"why are you making us do this?\""
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 21,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "0000000: 11110000 10011111 10011000 10000001                    ....\n",
      "['11110000', '10011111', '10011000', '10000001']\n"
     ]
    }
   ],
   "source": [
    "# Get the bits!\n",
    "!printf \"😁\" | xxd -b -g1\n",
    "# We're gonna use this in a minute\n",
    "byte_string_smiley = !printf \"😁\" | xxd -b -g1\n",
    "bytes = byte_string_smiley[0].split(\" \")[1:5]\n",
    "print bytes"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 22,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "The 1st byte: 11110000\n",
      "The character length in bytes, calculated using the 1st byte: 4\n",
      "The remaining bits in the first byte: 0000\n",
      "The non-'leading 10' bits in the next 3 bytes: ['011111', '011000', '000001']\n",
      "The bits of the code point: ['0000', '011111', '011000', '000001']\n",
      "The bit string of the code point: 0000011111011000000001\n",
      "The code point is: 128513 (or in hex 0x1f601)\n"
     ]
    },
    {
     "ename": "ValueError",
     "evalue": "unichr() arg not in range(0x10000) (narrow Python build)",
     "output_type": "error",
     "traceback": [
      "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m",
      "\u001b[0;31mValueError\u001b[0m                                Traceback (most recent call last)",
      "\u001b[0;32m<ipython-input-22-f2bf30f75b2f>\u001b[0m in \u001b[0;36m<module>\u001b[0;34m()\u001b[0m\n\u001b[1;32m     15\u001b[0m \u001b[0mcode_point_int\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mint\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mcode_point_bits\u001b[0m\u001b[0;34m,\u001b[0m\u001b[0;36m2\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m     16\u001b[0m \u001b[0;32mprint\u001b[0m \u001b[0;34m\"The code point is: {} (or in hex {})\"\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mformat\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mcode_point_int\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mhex\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mcode_point_int\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m---> 17\u001b[0;31m \u001b[0;32mprint\u001b[0m \u001b[0;34m\"And the character is: {}\"\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mformat\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0munichr\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mcode_point_int\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mencode\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m\"utf-8\"\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m     18\u001b[0m \u001b[0;32mprint\u001b[0m \u001b[0;34m\"Phew!\"\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n",
      "\u001b[0;31mValueError\u001b[0m: unichr() arg not in range(0x10000) (narrow Python build)"
     ]
    }
   ],
   "source": [
    "first_byte = bytes[0]\n",
    "print \"The 1st byte: {}\".format(first_byte)\n",
    "length_of_char = 0\n",
    "b = 0\n",
    "while first_byte[b] == '1':\n",
    "    length_of_char += 1\n",
    "    b += 1\n",
    "print \"The character length in bytes, calculated using the 1st byte: {}\".format(length_of_char)\n",
    "print \"The remaining bits in the first byte: {}\".format(first_byte[b:])\n",
    "print \"The non-'leading 10' bits in the next 3 bytes: {}\".format([x[2:] for x in bytes[1:]])\n",
    "print \"The bits of the code point: {}\".format(\n",
    "    [first_byte[b:]]+[x[2:] for x in bytes[1:]])\n",
    "code_point_bits = \"\".join([first_byte[b:]]+[x[2:] for x in bytes[1:]])\n",
    "print \"The bit string of the code point: {}\".format(code_point_bits)\n",
    "code_point_int = int(code_point_bits,2)\n",
    "print \"The code point is: {} (or in hex {})\".format(code_point_int, hex(code_point_int))\n",
    "print \"And the character is: {}\".format(unichr(code_point_int).encode(\"utf-8\"))\n",
    "print \"Phew!\""
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Part 2: But what if I like magic?\n",
    "----------------------\n",
    "\n",
    "Getting Python (2) to help you out with this."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### The hard way:\n",
    "\n",
    "The following should demonstrate to you that what we're about to do is exactly the same as what we just did, but easier."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 23,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "😁\n"
     ]
    },
    {
     "ename": "TypeError",
     "evalue": "ord() expected a character, but string of length 2 found",
     "output_type": "error",
     "traceback": [
      "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m",
      "\u001b[0;31mTypeError\u001b[0m                                 Traceback (most recent call last)",
      "\u001b[0;32m<ipython-input-23-5bd1108bf954>\u001b[0m in \u001b[0;36m<module>\u001b[0;34m()\u001b[0m\n\u001b[1;32m      9\u001b[0m \u001b[0mcode_point\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mtest_emoji\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mdecode\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m\"utf-8\"\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m     10\u001b[0m \u001b[0;32mprint\u001b[0m \u001b[0mcode_point\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m---> 11\u001b[0;31m \u001b[0mcode_point_integer\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mord\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mcode_point\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m     12\u001b[0m \u001b[0;32mfor\u001b[0m \u001b[0mbyte\u001b[0m \u001b[0;32min\u001b[0m \u001b[0mtest_emoji\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m     13\u001b[0m     \u001b[0mbytes\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mappend\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mbyte\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n",
      "\u001b[0;31mTypeError\u001b[0m: ord() expected a character, but string of length 2 found"
     ]
    }
   ],
   "source": [
    "# The 'rb' option to open (or mode = 'rb' to fileinput.FileInput)\n",
    "# this means, \"read in the file as a byte string.\" Basically, exactly what you get from\n",
    "# the xxd hexdump\n",
    "f = open(\"test.txt\", 'rb')\n",
    "# read the file (the whole file is one emoji character)\n",
    "test_emoji = f.read().strip()\n",
    "bytes = []\n",
    "bits = []\n",
    "code_point = test_emoji.decode(\"utf-8\")\n",
    "print code_point\n",
    "code_point_integer = ord(code_point)\n",
    "for byte in test_emoji:\n",
    "    bytes.append(byte)\n",
    "    bits.append(bin(ord(byte)).lstrip(\"0b\"))\n",
    "print \"The Unicode code point: {}\".format([code_point])\n",
    "print \"Integer value of the unicode code point: hex: {}, decimal: {}\".format(\n",
    "    hex(code_point_integer), code_point_integer)\n",
    "print \"The bytes (hex): {}\".format(bytes)\n",
    "print \"The bytes (decimal): {}\".format([ord(x) for x in bytes])\n",
    "print \"Each byte represented in bits: {}\".format(bits)\n",
    "f.close()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### The easy way:\n",
    "\n",
    "Now, imagine that you didn't want to have to think about bit strings every time you dealt with text data. We live in that brave new world.\n",
    "\n",
    "The big problem that I (we, I think) have been having with emoji and multibyte charaters in general is decoding them in a way that allows us to process one character at a time. I had this problem because I didn't understand what the encoding/decoding steps meant."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 31,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "😁\r\n"
     ]
    }
   ],
   "source": [
    "!cat test.txt"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 24,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "g = open(\"test.txt\")\n",
    "# read the file (the whole file is one emoji character)\n",
    "test_emoji = g.read().strip()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 25,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "list(test_emoji)\n",
      "['\\xf0', '\\x9f', '\\x98', '\\x81']\n"
     ]
    }
   ],
   "source": [
    "# Now, try to get a list of characters\n",
    "print \"list(test_emoji)\"\n",
    "print list(test_emoji)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Just asking for a list of all of the characters doesn't work, because Python 2 assumes ASCII (1 byte per character) and splits it up appropriately. \n",
    "We'd have to search all of the bytes to figure out which ones constituted emoji.\n",
    "\n",
    "I've implemented this, https://github.com/fionapigott/emoji-counter, because I didn't realize that there was a better way. But there is!\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 52,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[u'\\ud83d', u'\\ude01']\n"
     ]
    },
    {
     "ename": "error",
     "evalue": "unpack requires a string argument of length 4",
     "output_type": "error",
     "traceback": [
      "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m",
      "\u001b[0;31merror\u001b[0m                                     Traceback (most recent call last)",
      "\u001b[0;32m<ipython-input-52-4ffa45bdfcad>\u001b[0m in \u001b[0;36m<module>\u001b[0;34m()\u001b[0m\n\u001b[1;32m      3\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m      4\u001b[0m \u001b[0;32mprint\u001b[0m \u001b[0mlist\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mtest_emoji\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mdecode\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m'utf-8'\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m----> 5\u001b[0;31m \u001b[0;32mprint\u001b[0m \u001b[0mstruct\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0munpack\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m\"i\"\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mtest_emoji\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mdecode\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m'utf-8'\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m      6\u001b[0m \u001b[0;31m#.decode('utf-32')\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m      7\u001b[0m \u001b[0;32mprint\u001b[0m \u001b[0;34m\"list(test_emoji.decode('utf-8'))\"\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n",
      "\u001b[0;31merror\u001b[0m: unpack requires a string argument of length 4"
     ]
    }
   ],
   "source": [
    "# *Now*, try to get a list of characters\n",
    "import struct\n",
    "\n",
    "print list(test_emoji.decode('utf-8'))\n",
    "print struct.unpack(\"i\", test_emoji.decode('utf-8'))\n",
    "#.decode('utf-32')\n",
    "print \"list(test_emoji.decode('utf-8'))\"\n",
    "print list(test_emoji.decode('utf-8'))\n",
    "print list(test_emoji.decode('utf-8'))[0]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now if you want to search your code for \"😁\", you just need to know its code point (which you can find or even, if you're rather determined, derive)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 19,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "u'\\U0001f4ca'"
      ]
     },
     "execution_count": 19,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# Get the code point for this weird emoji\n",
    "\"📊\".decode(\"utf-8\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Appendices"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## A word on other encodings, with a tiny example\n",
    "\n",
    "There are many other encodings, such as ISO-8859-1, UTF-16, UTF-32 etc, which are less commonly used on the web, and for the most part, don't worry about them. They represent a variety of other ways to mape bytes -> code points and back again.\n",
    "\n",
    "I want to show one quick example of the UTF-32 encoding, which simply assigns 1 code point per 4-byte block. I'm going to show the encoding/decoding in Python, write the encoded data to a file, and read it back.\n",
    "\n",
    "I'm not showing this becuase UTF-32 is special or because you should use it. I'm showing it so you understand a little about how to work with other file encodings."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "print \"😁\"\n",
    "# Remember, and this is a bit hard: that thing we just printed was encoded at UTF-8 \n",
    "# (that's why Chrome renders it at all)\n",
    "print repr(\"😁\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "# Get the code point, so that we can encode it again with a different scheme\n",
    "code_point = \"😁\".decode(\"utf-8\")\n",
    "# You have to print the repr() to look at the code point value, \n",
    "# otherwise 'print' will automatically encode the character to print it\n",
    "print repr(code_point)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "# Now encode the data as UTF32\n",
    "utf32_smiley = code_point.encode(\"utf-32\")\n",
    "print repr(utf32_smiley)\n",
    "print \"The first 4 bytes means 'this file is UTF-32 encoded'. The next 4 are the character.\""
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "# That's a byte string--we can write it to a file\n",
    "utf32_file = open(\"test_utf32.txt\",\"w\")\n",
    "utf32_file.write(utf32_smiley)\n",
    "utf32_file.close()\n",
    "# No nasty Encode errors. That's good."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "# Butttt, that file looks like garbage, because nothing is going to automatically\n",
    "# decode that byte string as UTF-32\n",
    "!cat test_utf32.txt \n",
    "print \"\\n\"\n",
    "# We can still look at the bytes tho! And they should look familiar\n",
    "!cat test_utf32.txt | xxd -g1"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "# And we can read in the file as long as we use the right decoder\n",
    "utf32_file_2 = open(\"test_utf32.txt\",\"rb\")\n",
    "code_point_back_again = utf32_file_2.read().decode(\"utf-32\")\n",
    "print code_point_back_again"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Just when you thought you knew everything about Emoji\n",
    "It's worse than it seems! Well, just a little worse.\n",
    "\n",
    "One thing that I noticed when I was cat-ing a bunch of byte strings to my screen was that some emoji (not all) were followed by either \"ef b8 8e\" or \"ef b8 8f.\" I felt sad. Had I totally failed to understand how emoji work on Twitter? Was there something I was missing?\n",
    "\n",
    "The answer is no, not really. Those pesky multibyte charaters are non-display characters called \"variation selectors (http://unicode.org/charts/PDF/UFE00.pdf),\" and the change how emoji are displayed. There are lots of variation selectors (16, I think), but two apply to emoji, and they correspond to \"\\xef\\xb8\\x8e, or text style\" and \"\\xef\\xb8\\x8f, or emoji style\" display of the emoji characters,to allow for even more variety in a world that already allows for a normal hotel (🏨) and a \"love hotel\" (🏩).\n",
    "\n",
    "Not all emoji have variants for the variation selectors, nor do all platforms bother trying to deal with them, but Twitter does. If you ever find yourself in a position where you care, here's a quick example of what they do.\n",
    "\n",
    "You will need to open a terminal, because I couldn't find a character that would display in-notebook as both text style and emoji style.\n",
    "\n",
    "<pre><code>\n",
    "printf \"\\xE2\\x8C\\x9A\"\n",
    "printf \"\\xE2\\x8C\\x9A\\xef\\xb8\\x8e\"\n",
    "printf \"\\xE2\\x8C\\x9A\\xef\\xb8\\x8f\"\n",
    "</code></pre>\n",
    "\n",
    "Takeaway: Variation selectors are the difference between an Apple Watch and a Timex."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Python functions for dealing with data representations\n",
    "Some of the built-in functions that I used to manipulate binary/hex/decimal representations here:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "# Shoutout to Josh's RST!\n",
    "def print_output(function,input_data,kwargs={}):\n",
    "    kwargs_repr = \",\".join([\"=\".join([x[0], str(x[1])]) for x in kwargs.items()])\n",
    "    print \"{}({},{}) -> {}\".format(function.__name__, repr(input_data), kwargs_repr,\n",
    "                                    repr(function(input_data,**kwargs)))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "# Decimal to hex:\n",
    "print \"Converting decimal to hex string:\"\n",
    "print_output(hex,240)\n",
    "# hex to decimal\n",
    "print \"\\nConverting hex to decimal:\"\n",
    "print_output(int,hex(240),kwargs = {\"base\":16})\n",
    "# decimal to binary\n",
    "print \"\\nConverting decimal to binary:\"\n",
    "print_output(bin,240)\n",
    "# binary string to an integer\n",
    "print \"\\nConverting decimal to binary:\"\n",
    "print_output(int,\"11110000\",kwargs = {\"base\":2})\n",
    "# byte string representation to ordinal (unicode code point value)\n",
    "print \"\\nConverting byte string to ordinal\"\n",
    "print_output(ord,\"\\x31\")\n",
    "print_output(ord,\"\\xF0\")\n",
    "# ordinal to unicode code point\n",
    "print \"\\nConverting ordinal number to unicode code point\"\n",
    "print_output(unichr,49)\n",
    "print_output(unichr,240)"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 2",
   "language": "python",
   "name": "python2"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 2
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython2",
   "version": "2.7.8"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 0
}
